TTS Model Evaluation for Kira

Comparing commercial and open-source TTS models across quality, latency, and cost

Generated: 2026-01-28 14:24

Commercial TTS Models

Commercial Models - Performance Overview

Azure, ElevenLabs, and MiniMax - paid API services with enterprise support

Provider Avg Latency Realtime Factor Languages
Azure TTS 262ms 25.2x 140
Azure Streaming 211ms 32.7x 140
ElevenLabs Standard 1124ms 4.7x 29
ElevenLabs Turbo 239ms 18.4x 29
MiniMax 4550ms 1.4x 11
MiniMax Streaming 1601ms 4.5x 11
MiniMax PCM 1813ms 4.0x 11

Commercial Models - Audio Comparison

Listen and compare voice quality across commercial providers:

basic

"Hello, how can I help you today?"

Azure

Non-Streaming

Azure TTS

Latency: 250ms

Streaming

Azure Streaming

Latency: 207ms

ElevenLabs

Non-Streaming

ElevenLabs Standard

Latency: 833ms

Streaming

ElevenLabs Turbo

Latency: 161ms

MiniMax

Non-Streaming

MiniMax

Latency: 3647ms

Streaming

MiniMax Streaming

Latency: 1415ms

PCM Streaming

MiniMax PCM

Latency: 1901ms

complex

"Dr. Smith's API at api.example.com returns JSON for the Q4 NYSE report."

Azure

Non-Streaming

Azure TTS

Latency: 289ms

Streaming

Azure Streaming

Latency: 240ms

ElevenLabs

Non-Streaming

ElevenLabs Standard

Latency: 1216ms

Streaming

ElevenLabs Turbo

Latency: 267ms

MiniMax

Non-Streaming

MiniMax

Latency: 5202ms

Streaming

MiniMax Streaming

Latency: 1608ms

PCM Streaming

MiniMax PCM

Latency: 1792ms

numbers

"Your balance is $12,847.53, payment of $299.99 due January 15th, 2026."

Azure

Non-Streaming

Azure TTS

Latency: 281ms

Streaming

Azure Streaming

Latency: 226ms

ElevenLabs

Non-Streaming

ElevenLabs Standard

Latency: 1318ms

Streaming

ElevenLabs Turbo

Latency: 304ms

MiniMax

Non-Streaming

MiniMax

Latency: 5495ms

Streaming

MiniMax Streaming

Latency: 1582ms

PCM Streaming

MiniMax PCM

Latency: 1804ms

chinese_mix

"欢迎使用Kira智能助手,我是您的AI客服,请问有什么可以帮您?"

Azure

Non-Streaming

Azure TTS

Latency: 243ms

Streaming

Azure Streaming

Latency: 187ms

ElevenLabs

Non-Streaming

ElevenLabs Standard

Latency: 1191ms

Streaming

ElevenLabs Turbo

Latency: 240ms

MiniMax

Non-Streaming

MiniMax

Latency: 4279ms

Streaming

MiniMax Streaming

Latency: 1760ms

PCM Streaming

MiniMax PCM

Latency: 1765ms

english_mix

"Welcome to Kira智能助手, your AI客服 for 24/7 support."

Azure

Non-Streaming

Azure TTS

Latency: 245ms

Streaming

Azure Streaming

Latency: 197ms

ElevenLabs

Non-Streaming

ElevenLabs Standard

Latency: 1062ms

Streaming

ElevenLabs Turbo

Latency: 225ms

MiniMax

Non-Streaming

MiniMax

Latency: 4128ms

Streaming

MiniMax Streaming

Latency: 1640ms

PCM Streaming

MiniMax PCM

Latency: 1801ms

Commercial Models - Cost Projection

Provider 100K chars/mo 500K chars/mo 1M chars/mo Model
Azure TTS $1.60 $8.00 $16.00 en-US-AvaMultilingualNeural, non-streaming
Azure Streaming $1.60 $8.00 $16.00 en-US-AvaMultilingualNeural, streaming
ElevenLabs Standard $16.50 $82.50 $165.00 eleven_multilingual_v2, high quality
ElevenLabs Turbo $16.50 $82.50 $165.00 eleven_turbo_v2_5, low latency
MiniMax $6.00 $30.00 $60.00 speech-2.6-turbo, non-streaming
MiniMax Streaming $6.00 $30.00 $60.00 speech-2.6-turbo, streaming MP3
MiniMax PCM $6.00 $30.00 $60.00 speech-2.6-turbo, streaming PCM
Audio Formats & Connection Compatibility

Audio Format Details by Provider

Output format comparison for WebSocket/WebRTC integration

Provider Mode Format Sample Rate Bit Depth
Azure TTS Non-Streaming RIFF/WAV PCM 24 kHz 16-bit Mono
Azure Streaming Streaming RIFF/WAV PCM 24 kHz 16-bit Mono
ElevenLabs Standard Non-Streaming Raw PCM 24 kHz 16-bit Mono
ElevenLabs Turbo Streaming Raw PCM 24 kHz 16-bit Mono
MiniMax Non-Streaming WAV 32 kHz 16-bit Mono
MiniMax Streaming Streaming (MP3) MP3 (128kbps) 32 kHz Compressed
MiniMax PCM Streaming (PCM) Raw PCM 24 kHz 16-bit Mono
Qwen3-TTS Both Raw PCM 24 kHz 16-bit Mono

MiniMax Format Comparison

MiniMax Streaming (MP3): Uses lossy MP3 compression. Requires MP3 decoding before WebRTC transmission, adding latency and complexity.

MiniMax PCM: Raw PCM output at 24kHz - directly compatible with WebRTC/WebSocket. No transcoding needed.

WebSocket/WebRTC Compatibility

WebRTC Requirements

Status: Testing requirement

  • Natively supports Opus and PCM codecs
  • MP3 is not a native WebRTC codec
  • MP3 → PCM decoding adds latency and complexity
  • Recommended: Use PCM-outputting providers (Azure, ElevenLabs, MiniMax PCM)

WebSocket Requirements

Status: Production use

  • Can transport any binary format (MP3, WAV, PCM)
  • For real-time playback, raw PCM is preferred
  • No decoding overhead = lower client-side latency
  • MiniMax Streaming (MP3) works but requires client-side decode

Recommendation for Kira

  • Lowest Latency (EN): Azure Streaming (~210ms) or ElevenLabs Turbo (~240ms) - both output PCM, WebRTC compatible
  • Best Practice: Use PCM-outputting providers (Azure, ElevenLabs, MiniMax PCM) for both WebSocket and WebRTC compatibility
Analysis & Recommendations

Understanding the Metrics

Latency (ms)

Time to generate audio from text. Lower is better. Under 500ms feels instant, 500-1500ms is acceptable, over 1500ms feels slow.

Realtime Factor

How much faster than playback speed. Example: 15x means a 3-second clip generates in 0.2 seconds. Need at least 1x for real-time apps.

Cost (per 1M chars)

Price per 1 million characters. 1M chars ≈ 250 pages of text or ~170 minutes of speech.

Languages Supported

Total number of languages the provider supports. More languages means broader global coverage.

Streaming vs Non-Streaming

Key Insight: Streaming is a delivery optimization, not a quality setting. The audio quality is identical - only the timing differs.

Note: Audio durations may vary slightly between runs because TTS synthesis is non-deterministic and each API call generates audio independently.

Same TTS Model

Both modes use the exact same voice synthesis model. The audio generation algorithm is identical.

Different Delivery

Non-streaming: Wait for complete audio, return all at once.
Streaming: Return audio in chunks as it's generated.

Latency Benefit

Streaming lets audio playback start sooner (e.g., 180ms vs 260ms), reducing perceived wait time.

Analogy

Like streaming vs downloading a video - the quality is identical, you just start watching sooner with streaming.

Why Audio Lengths May Differ

Non-Deterministic Synthesis

TTS models don't produce identical output every time. Each API call generates audio independently with slight variations in pacing.

Different Models (ElevenLabs)

ElevenLabs Standard vs Turbo use completely different models with different speaking rates, causing noticeable duration differences.

Expected Behavior

Small duration differences (±10%) between runs are normal. The content and quality remain consistent.

Why Latency Varies by Provider

Azure (~260ms non-streaming / ~210ms streaming)

Why only ~20% difference? Azure is already highly optimized with global infrastructure.

Global CDN

140+ regions worldwide. Requests are served from the nearest data center, minimizing network latency.

Already Fast

Base latency is already low (~260ms). Streaming improves by ~50ms - noticeable but not dramatic.

Short Text Samples

For short utterances, total generation time is minimal. Streaming benefit increases with longer text.

Streaming Benefit is Relative

Compare: MiniMax streaming saves 65%, ElevenLabs Turbo saves 79%. Azure saves 20% because it's already optimized.

ElevenLabs (Standard ~1100ms / Turbo ~250ms)

Two Models Available:

  • Standard (eleven_multilingual_v2): ~1100ms - Best quality, higher latency
  • Turbo (eleven_turbo_v2_5): ~250ms - Near-Azure speed, slightly lower quality

Standard Model (~1100ms)

Premium quality with advanced neural models. Best for pre-generated content where latency isn't critical.

Turbo Model (~250ms)

Optimized for real-time streaming. Quality is still excellent - most users won't notice the difference.

Regional Endpoints

api.elevenlabs.io → US only
api-global-preview.elevenlabs.io → Auto-routes to closest (US/EU/Singapore)

Recommendation

Use Turbo for real-time applications. Use Standard only when quality is paramount and latency is acceptable.

MiniMax (~4550ms)

Key Factor: MiniMax is a Chinese company. We tested using api.minimax.io (international endpoint), which still routes to China servers.

Note: MiniMax also offers api.minimax.chat (China domestic endpoint) which may have lower latency for users in China.

No Overseas Servers

Both endpoints (api.minimax.io and api.minimax.chat) route to servers in mainland China. MiniMax has no data centers outside China.

International Endpoint ≠ International Servers

The .io domain is just a gateway for overseas access. Requests still travel to China and back, adding 200-400ms+ network latency.

No Global CDN

Unlike Azure (140+ global regions) or ElevenLabs (US/EU servers), MiniMax lacks distributed infrastructure.

Streaming Helps

~4550ms → ~1600ms TTFB (65% faster). Always use streaming mode for MiniMax to reduce perceived wait time.

MiniMax Considerations

  • High latency (International): ~1600-1970ms TTFB from overseas - not ideal for real-time outside China
  • Recommended for Chinese: Native Chinese language support, ideal for Chinese content and users in China
  • China pricing (Recommended): speech-2.6-turbo costs 2 CNY/10K chars (~200 CNY/1M ≈ $28 USD) - much cheaper than international $60/1M

Recommendations (Commercial TTS)

Key Findings

  • Best for Real-time (EN): Azure Streaming (~210ms) or ElevenLabs Turbo (~240ms) - both excellent for low-latency applications
  • Best Quality: ElevenLabs Standard - most natural sounding, but higher latency (~1100ms) and cost ($165/1M)
  • Best for Chinese Users: MiniMax - native Chinese support, affordable in China (2 CNY/10K chars ≈ $28/1M USD)
  • Most Languages: Azure TTS - supports 140 languages with consistent quality

Azure

Best overall: lowest latency (~210ms), 140 languages, $16/1M. Ideal for real-time global applications.

ElevenLabs

Best quality: use Turbo (~240ms) for real-time, Standard (~1100ms) for premium quality. $165/1M.

MiniMax

Best for Chinese: native Mandarin support. High latency internationally (~1600ms), recommended for China users.

Open-Source TTS Models

Open-Source Models - Performance Overview

Qwen3-TTS and LuxTTS - free/low-cost alternatives with self-hosting options

Provider Avg Latency Realtime Factor Languages
No data available

Open-Source Models - Audio Comparison

Listen and compare voice quality across open-source providers:

No audio samples available

Open-Source Models - Cost Projection

Provider 100K chars/mo 500K chars/mo 1M chars/mo Model
Qwen3-TTS $1.00 $5.00 $10.00 qwen3-tts-flash, non-streaming
Qwen3-TTS Streaming $1.00 $5.00 $10.00 qwen3-tts-flash-realtime, streaming
LuxTTS $0.00 $0.00 $0.00 Local CPU - no API cost

Open-Source Advantage

  • Qwen3-TTS: Uses DashScope API (~$10/1M chars) or can be self-hosted for free
  • LuxTTS: Runs locally on CPU - zero API cost, only compute resources
  • No vendor lock-in: Full control over models and data